\chapter{Description of the provided documents}

In this chapter we will give an overview of documents that we received from the CEZ, a.s company. Precisely we will describe the scope of the documents, their size, structure and format. In the last section will try to identify and select the most interesting ones on which the research of this thesis will be based.

\section{Documents}

The documents are separated into 13 different sets. Each of them contains various types of company documentations: presentations, reports, records, employees profiles, project documentations. The following subsections describe every set.

\subsection{Field of experience summary}

This set contains documents, summarizes knowledge management and it is consisted of 32 files. Most of the documents are PowerPoint presentations. These presentations have many tables, enumeration, short sentences and abbreviations. Sometimes they contain inner documents of Word and Excel type. Other documents in this set are exported probably from database and accommodate experience description of various events. Each of this documents contain more then one of these events. The presentations have mostly more then 20 slides. The length of the inner word and excel documents vary from 1 page up to 10.

\subsection{Experience records}
\label{sec:Experience}

This is one of the largest set and has about 250 files. The documents are mostly describing individual records of experiences, observations and researches. The first page summarizes the document in a table. The table contains names of the authors and supervisors, short description of the event, keywords, date of origin and many other details. The text is structured in sections with long continuous texts, tables, enumerations. All these documents were created in the Microsoft Word and have mostly more then 5 pages. Some of the documents are also in Portable Document Format \emph{(PDF)} and were scanned from a printed version of these Word files. Other documents in this set are various forms and excel tables with names, numbers, shortcuts and phrases.

\subsection{Expert's profiles}

All documents are PowerPoint presentations with the length of one or two slides. The slide is consisted from one big table containing employee's profile, probably imported from a database. The profiles store informations about name, ID, position, experience, role, education, references and many other details about employees. There are 13 documents in this set.

\subsection{Substitution program}

This set of documents is about internship program within the company. The documents are reporting the transmission of expertise and experience between employees. The documents are divided into groups of Word document files, excel spreadsheets and PDF files. Word and PDF files have mostly structured introduction describing the goals of the educational program followed by detail report in multiple pages. The Excel spreadsheet documents are forms with predefined slots filled with description about the program event. 59 documents in this set.

\subsection{Knowledge management concept}

It has only one file, Power point presentation exported into PDF format. The document has 124 slides with many tables, indented short texts, images. The purpose of this document is to present the knowledge management concept.

\subsection{Forms}

The documents are combination of forms and templates for writing reports and events of various projects. It is a mixture of Word documents and Excel spreadsheets with short coherent texts and it is structured into tables with pre-filled example values. The set has 17 files.

\subsection{Yearly reports}

Presentations, transcripts and documentation of annual projects' reports. Presentations are filled with tables, images, itemizations with short sentences. Transcriptions contain structured list of goals with short descriptions. Documentations are mostly divided into sections with structured longer continuous text spanning into several pages. 10 files in this set.

\subsection{General presentations}

PowerPoint presentations of various projects, events, educations. The structure and length of this presentations vary in every document. They are consisted of tables, itemization, images, no larger continuous text.

\subsection{Reports}

3 Excel spreadsheet reports. Tables with dates, names, values, phrases. No continuous text is presented.

\subsection{Experience records from KM}

Experience records are same as documents in section \ref{sec:Experience} but from a different source. All documents are in PDF format exported from the Word document. The set contains 9 large documents.

\subsection{Management documentation}

Documentations of various project written for the needs of management. Textual documents with a summary in the first page followed by sections with longer continuous text. The text is divided into sections containing tables and itemization with texts of various length.

\subsection{General enterprise documentation}

This set of documents is a mash-up of presentations, educational texts and tests used to prepare employees, suppliers and emergency staff to work in restricted areas of the company. The set contains about 300 different files. Every document differs in content, size and structure.

\subsection{Supplier's training}

This set accommodates presentations about training programs for suppliers to be able to work in the CEZ company. The use of tables, itemization, images are typical for this type of documents. There is about 50 files in this set.

\section{Summary}

The received files represents typical daily company documents. The common types of the files are Word, Excel, PowerPoint, PDF. Some of the PDF files are printed from other documents, however some of them are scanned that will make text extraction difficult. The purpose of this thesis is to test the possibility of the information extraction, natural language processing and semantic searching. It implies that we need to create a representative set from these documents. Therefor we need to select every type of these documents, but not only the different types of the files as .doc, .pdf, .ppt, but also the different contents. We need to have a set containing documents with large continuous texts, itemized short texts, tables, texts with abbreviations, phrases. This set will be created from these documents:

\begin{itemize}
	\item \textbf{Experience records} - semi-structured large texts with tables, abbreviations, values. Different domains of information. Word and PDF types.
	\item \textbf{Expert's profiles} - structured PowerPoint files with short itemized texts, abbreviations.
	\item \textbf{Substitution program} - reports, different file types, different domains, larger texts.
	\item \textbf{Management documentation} - project documentations, larger texts, different domains.	
\end{itemize}

Therefore we will now be working only with these documents.